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7 asic changes in the way people use and 
A^^program distributed systems are on the 
jL/ horizon. These changes range from new 
low-level programming techniques that make it 
easy to build fault-tolerant distributed applica 
tions that exploit replication and concurrent exe 
cution to a meta-operating system with services 
spanning large numbers of machines in heteroge 
neous networks. 

A Programming Revolution 

Users of distributed-computing systems rapidly 
discover how similar such systems are to the time 
shared machines of the 1970s: The pervasive use 
of “network transparency” techniques largely 
conceals the fact of distribution. For example, the 
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dominant distributed programming technology, The requirements of a distributed application 
remote procedure call f RPC), permits a program run may go beyond those of a timeshared or parallel 

ning on one machine to invoke a procedure resid- program. Not only does a distributed computer 
ing in some other program. Given adequate lan system need to exploit concurrency, but it may 

^uage support, an RPC interface can hide many also need to remain operational in the presence of 

details of message based interaction and connec- “partial” failures — that is, situations in which one 

tion management from a user. The idea of trans- of the machines connected to a network fails or 

parency also extends to other parts of a typical becomes partitioned from the others, while the 
distributed system. Using a hlesystem such as majority of the machines remain operational and 
NFS, a program can operate on hies that physical must reconfigure themselves and continue exe- 
ly reside on a remote machine in the network cuting. The complementary problem of reinte- 
using exactly the same interface as for local hies. grating a recovered machine into an online sys- 
If the distributed-computing revolution is under tern also arises. 

way, its impact on how programs are written has From this perspective, transparency may not 
been minor. be such a tremendous win. RPC is a pairwise pro 

This lack of impact is troubling, especially in gramming methodology: although RPCs can nest, 
light of the many reasons why distributed com more complex interactions are not normally con- 
puting should be different from nondistnbuted structed from RPCs. A transparent RPC mecha 
programming. Parallel computing is certainly nism offers little in applications that require con- 
analogous to distributed computing in many re- current action by more than two processes at a 
spects. \et, whereas the effective use of parallel time, especially if those processes must cooperate 
machines has lead to fundamentally new pro- but are not controlled by a common ancestor, 
gramming technologies such as Linda 1 and CSFU Moreover, most RPC mechanisms handle failures 
the same thing has not happened in the case of by either timing out or by retrying a request sever- 
distnbuted programs. If distributed systems are al times, at best providing some form of “at most 
built using technologies that proved to be unsatis- once” guarantees. This is not a sophisticated way 
factory in parallel settings, then distributed sys- of reacting to a failure. 

terns are probably making ineffective use of con- The problem is that the gap between these Ulu 
currency (parallelism). I mechanisms and a coordinated algorithm (where jack oTsrocher 




by a set of processes joins forces to solve a prob- 
lem in a fault-tolerant manner) is simply too large 
for the average programmer to bridge. The unfor 
tunate user whose distributed problem doesn’t fit 
into these paradigms must undertake a complex 
and costly system-development effort or abandon 
a distributed solution entirely. 

One approach to these problems is to augment 
RPC with transactional features oriented toward 
controlling concurrency and ensuring that persis- 
tent objects can recover their states after failures. 
Prominent among efforts to do this are the Argus 
system, which focuses on language issues, 3 and 
the Camelot system, which focuses on perform- 
ance. 4 These sorts of '‘better behaved” RPC 
mechanisms will doubtless play a major role in 
the distributed systems of the future. On the 
other hand, they are not resulting in fundamental- 
ly new strategies for exploiting the network, and 
after a 10-year development period, their most 
important role has been in creating and managing 
special-purpose databases. 

The major premise of the authors’ project, ISIS, 
is that when a distributed system is viewed as a 
timeshared system or encourages its users to pro- 
gram as if their application were running on a 
dedicated idle system, as with transactional RPC, 
the most powerful resource that a distributed sys- 
tem offers us is lost: distribution itself. We lose the 
ability to employ a set of processes in a coordinat- 
ed, cooperative attack on a problem. We lose the 
ability to apply highly adaptive, reconfigurable 
solutions to applications that must remain on line 
in the presense of failures and recoveries. And, we 
lose the possibility of building a distributed sys- 
tem that is more fault tolerant and offers higher 
performance than any of its components. 

As we enter the 1990s, the state of the art in 
distributed computing embodies a paradox. On 
the one hand, networks are becoming ubiquitous 
and can be used and programmed much like the 
timeshared processors of the 1960s and 1970s. 
On the other hand, the proliferation of networks 
has yet to result in any sweeping changes in the 
way we develop software. Furthermore, a large 
class of applications seems to lie out of reach: 
those that require direct coordination among a 
set of processes, replicated data, parallel execu 
tion of requests, or a coordinated response to a 
failure or some other reconfiguration event. The 
techniques that have helped achieve networked 
computing offer no easy solutions to these intrin- 
sically distributed applications. 

A new programming technology now promises 
to open the door to solving these applications: the 
ISIS 5 Programming Toolkit. The ISIS Toolkit is a 
‘‘low level” programming technology. It can 
change the way distributed systems are built, but 
it will not directly change the available higher 
level services of distributed operating systems. 


Concurrent with the ISIS Toolkit is the META 
project, which is reexamining high-level mecha 
nisms taken for granted in distributed systems — 
the filesystem, the shell language, and the admin 
istration tools. 

ISIS and META, together with other technolo 
gies, herald basic changes in the way we think 
about and use computer networks. Rather than 
viewing networks primarily as a way to connect a 
program running in one place with a resource 
that “lives” someplace else, these technologies 
permit distributed software design that makes ex 
plicit use of the distributed character of the net 
work environment. Although no one system ad 
dresses the whole spectrum of distributed 
requirements, taken as a composite they offer a 
sweeping range of new and powerful ways to ap 
proach distributed problems. 

The ISIS Distributed-Computing Model 

Prior to a detailed review of the ISIS toolkit, an 
understanding of some of the programming 
structures of ISIS is helpful. Like most distributed 
systems, ISIS is based on processes and messages. 
Our notion of a process is the basic UNIX one: 
each execution of a program gives rise to a pro 
cess — an address space containing one or more 
lightweight tasks (also called threads of control or 
lightweight processes). In ISIS, each arriving mes 
sage is handled by a separate task. Although task 
execution is FIFO and nonpreemptive, tasks can 
explicitly wait on and signal condition variables 
when desired. (For more on lightweight tasks, re 
fer to The ISIS Programming Manual and User's 
Guide. 6 ) 

ISIS assumes that both processes and the com- 
munication system can fail. ISIS is limited in the 
kinds of failures it can handle. It tolerates commu- 
nication failures that involve lost messages, but it 
may hang if communication is completely dis 
rupted between sets of sites by a network parti 
tioning. 7 

ISIS also supports reconfiguration and contin- 
ued execution after crash failures, whereby pro- 
grams or machines simply stop executing (most 
software failures result in crashes of this sort). It is 
of limited value in a system subject to more ex- 
treme crash modes, such as when a software bug 
causes a program to go into an infinite loop or to 
send messages containing incorrect data. Because 
ISIS has no way to distinguish these behaviors 
from correct ones, neither condition can be de 
tec ted. Yet, if the application designer can detect 
such a condition, ISIS does offer took to over 
come it. 

Communication among ISIS processes is by 
messages. These contain streams of typed data 
items, which are added to the message by use of 
format strings (shown below). Because ISIS knows 
the type of each data item, the reception side 
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Figure 1. A stockbroker's trading system 

handles byte-order conversions automatically. 

ISIS also supports virtually synchronous process 
groups . These groups consist of a set of processes 
that are cooperating for “some purpose,” be it 
distributed execution of requests, management 
of replicated data, or whatever. A process can 
belong to many process groups, and the members 
of any particular group need not be identical or 
even be programmed in the same language, al- 
though they are expected to use compatible 
group-management algorithms. Each group has a 
symbolic pathname and a unique 24 -byte address 
that can be used to communicate with it. Group 
addresses and process addresses can be used in- 
terchangeably throughout ISIS; when a message 
is sent to a group address, the system expands this 
into a reliable broadcast 8 to the current member- 
ship of the group. 

Before saying more about the notion of virtual 
synchrony, or even what it means for a broadcast 
to be reliable, we look at some typical ways that 
ISIS applications use process groups. To make the 
example concrete, consider a stockbroker’s trad 
ing system composed of three types of entities 
(see Figure 1). At the front end, the system has 
workstations that display current quotes and trad- 
ing advice. These employ an interactive com- 
mand interface. Connected to the system are 
“ticker” devices, the computer readable analog 
to the mechanical stock tickers used in the past. In 
this system, tickers are redundant because the 
risk of a failure must be kept to a minimum. (For 
simplicity, the process that handles a given ticker 
device is referred to as a ticker.) 


The system also provides a variety of analysis 
services capable of searching databases for infor- 
mation needed by the trader, calculating suggest- 
ed buy and sell margins based on trend analyses, 
comparing options and futures prices with cur- 
rent quotes, and other tasks. 

A system such as this could use ISIS process 
groups in several ways. At the front end, the set of 
stocks that any given broker is monitoring will 
likely vary over time, perhaps quickly in modem 
program trading. If a ticker process receives a 
new quote for Sun, how is it to know what work- 
stations currently need this information? 

An easy solution is to create a process group for 
each stock currently being monitored. All pro- 
grams wanting quotes for that stock would join 
the group. A ticker would then broadcast quotes 
to the appropriate process groups and leave ISIS 
to cope with their dynamically changing member 
ship (see Figure 2). 

Remember that for reasons of fault tolerance, 
ticker processes are redundant. A failure might be 
due to the crash of a ticker, or it might be due to a 
“softer” problem such as a transient overload or a 
burst of line noise that garbles a quote. Ticker 
processes can also be redundant for purposes of 
load sharing. Even in the absence of failures, a 
single ticker cannot practically deal with all 
quotes on behalf of all brokers in a large trading 
room. Forming the ticker processes into a process 
group makes programming the necessary control 
algorithm easy. 

ISIS permits users to assign responsibility based 
on the initial letter of a stock’s name in a manner 
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Figure 2. Tichcrx must deal with a dynamically changing c/wntc/c 


that rules out confusion about which ticker han 
dies which stocks — that is, there is no risk of hav 
ing two tickers watching stocks starting with the 
letter A and none watching stocks starting with 
the letter C. After a system crash or recovery, ISIS 
provides a reconfiguration method — in this case, 
reassigning the letters to ensure full coverage. It 
also provides a way to cope with tickers that gar 
ble or miss individual quotes and to retransmit 
quotes that arrive between the time^vhen a ticker \ 
fails and when reconfiguration takes place. One 
could implement a redundant assignment rule — 
for example, by arranging for two tickers to han 
die each stock, as shown in Figure 3. Although 
quotes arrive in duplicate and stale quotes must 
be discarded, when a failure occurs, quotes keep 
flowing. Two nearly simultaneous failures would 
have to occur to prevent a broker from obtaining 
timely information. 

T: .e same techniques that can be used to assign 
stocks to tickers can also be used to subdivide 
other types of computations. For example, a com ; 
plex database search can be divided into parts j 
that each member of a set of servers will perform j 
independently (merging the results at the end). 
The same mechanisms can also be used fot an 
analysis that requires opinions from multiple “ex- 
perts.'* 

Another area in which ISIS offers sophisticated 
support is in the use of leplicated data. Process 
group members can easily maintain replicated 
data structures, updating them at extremely low ! 


cost and permitting direct read access, much as 
with an accurate cache. 

Replicated data is an example of a group state 
that changes dynamically. Many servers need to 
maintain dynamic state information, be it repli- 
cated data, descriptions of pending requests, or 
lists of currently held locks. When a process recov- 
ers and needs to join an operational system, trans- 
fen ing this information poses a thorny problem. 
Because the information changes in response to 
some types of events, one must lock out those 
kinds of events while copying the state of the 
group to the new member. Clearly, the transfer 
must be fault tolerant. 

Overall, these factors add up to the kind of 
problem most programmers would hnd hard to 
sol\e. ISIS, however, has a group-join mechanism 
that automates the task. The programmer sup | 
plies a routine to transfer the group state out to 
the new member and a routine to receive and 
unpack the state' when it arrives. ISIS correctly 
synchronizes the "join** and handling failures. Cli 
ents using the group will usually be unaware that j 
a new meinhei joined while they were talking to 1 
the group. The method wot ks well foi states of up j 
to a tew hunch ed kilobytes in size, whu h is enough ; 
for most database purposes .additional mecha- 
nisms foi groups managing much larger states are 
now being designed i. 

What would ISIS not he useful for. 3 One major 
area is transactional database and hie manage 
mentT Some* powerful systems for this task are 
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Figure 3- Using redundancy to obtain fault-tolerant quote dissemination. 


available, and ISIS was designed to avoid duplicat- 
ing these efforts. As a result, although ISIS has 
powerful synchronization mechanisms, it is not 
oriented toward serialization (a widely accepted 
technique for maintaining database correctness) 
or atomic transactions (the usual technique for 
database crash recovery). Instead, ISIS focuses on 
cooperative distributed algorithms and on the vol- 
atile and rapidly changing state of a distributed 
computation. 

Virtual Synchrony 

ISIS is not the first system to use process groups, 
but its process group mechanism is unusually 
powerful. The reason for this power is a theoreti- 
cal advance called virtual synchrony . 

When reading about the various schemes for 
subdividing work among a set of tickers or a col 
lection of expert subsystems, you may have won 
dered how these schemes can possibly be correct 
when failures and recoveries occur. Certainly, al 
most any algorithm that uses RPCs for interac- 
tions between the ticker processes and time-outs 
to detect failures will be complex and prone to 
errors. The risk of ending up in a state in which no 
process sends quotes for IBM stock seems very 
real (for example, one process covers A H, anoth 
er J-N, but no process covers I). This situation 
occurs if processes have inconsistent views of one 
another’s status. And, such an inconsistency can 
easily arise because transient phenomena and 
overloads can mimic failures in networks of work 


stations. Worse, failures and other events might 
be observed in different orders by different pro- 
cesses in a distributed system. One can imagine 
an algorithm that behaves differently on detect- 
ing a given event in one state than in another, and 
that changes state in response to events it ob 
serves. Two instances of this algorithm might not 
give the same behavior even when executed on 
the same events but in different orders. If one 
treated a transient overload as a failure but the 
other did not, inconsistency would certainly arise. 

To take an extreme example from a different 
setting, consider a factory that produces some 
sort of chemical and that the production strategy 
changes if some critical valve is not responsive. A 
programmer might decide to decentralize control 
among a set of control programs to gain increased 
fault tolerance and benefit from load sharing. 
Should the programmer not now be worried that 
one component of the control subsystem could 
incorrectly conclude that the valve has jammed, 
perhaps because of a communication failure (and 
hence switch to the emergency shutdown proce 
dure), while the remainder of the system contin 
ues normal operation, unaware of what has hap- 
pened? Clearly, correct behavior in each of the 
components of a system does not automatically 
imply mutually correct behavior of the system as 
a whole. 

In light of these sorts of problems, how can one 
be sure that a rule like the one we proposed for 
controlling the set of tickers will operate correctly 
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Figure 4. In a synchronous execution, only one event can happen at a time However , a stngle event can occur ( or 
he observed ) at multiple places, as with the delivery' of a broadcast or the detection of a failure 

Figure 5: A virtually synchronous execution is indistinguishable to the application from a synchronous one Event 
orderings may dtffer from process to process but only if behavior of the application is unaffected. 



in other applications? How is it possible to decen programs shown in Figure 4. Here, time advances 
tralize a factory control system with confidence from top to bottom, with one distributed event 
that it will still function correctly? occurring per time interval. The types of events 

Virtual synchrony efficiently solves this class of shown include broadcasts of quotes, failures, and 
problems. It starts with a simplified model for joins. Notice that a broadcast to a group is always 
how a distributed system executes — the program delivered to all the current members at once. Sim 
is coded as if this model were realistic. Then, ilarly, all group members see failures in unison, 
much as a compiler may produce code that differs A synchronous system can be inherently costly 
from the original program without changing its and scale poorly. To avoid this problem, ISIS runs 

behavior, the distributed program is executed in programs designed for synchronous environ 
ways that improve efficiency and permit it to run merits in a much more concurrent manner by 
in a realistic environment while preserving cor- relaxing any kinds of synchronization that the 
rectness. Because the program is written for a algorithm doesn't really depend upon. For exam 

setting that differs from the one in which it runs, pie, if a synchronous algorithm doesn’t look at a 

ISIS supports a virtual environment. realtime clock, it will execute correctly even in a 

The idealized environment of ISIS is illustrated “loosely” synchronous setting, where event order- 
by the distribution of quotes to a set of application ings are the same as for a synchronous environ 


96 Sun Technology 


Summer 1989 


merit; but the actual time at which events occur or 
messages are delivered can differ from process to 
process. A virtually synchronous environment 
goes further: in many cases, it presents events in 
different orders by different processes, provided 
that they all behave indistinguishably from some 
synchronous execution (see Figure 5). 

The idea of virtual synchrony is rooted in data- 
base and distributed systems theory. 10 One rea- 
son why the idea was not applied to distributed 
systems sooner is that it needs a careful analysis of 
the ordering requirements of the application be 
ing run, and lacking such an analysis, perform- 
ance is certain to be poor. ISIS applications inter- 
act through our toolkit, however, so the toolkit 
algorithms can be analyzed and optimized, and 
this benefits ISIS applications as a whole. 11 

In practical terms, virtual synchrony means 
that ISIS applications execute in a simplified envi- 
ronment in which a layer of software hides many 
of the difficulties that make distributed program- 
ming so hard. All the ISIS tools have simple inter 
faces and simple behavioral descriptions that in- 
clude failure cases. User code carries little risk of 
unpleasant surprises such as race conditions, in- 
consistent views of the status (failed or operation 
al) of processes, or ordering anomalies that can 
lead to inconsistent behaviors in different parts of 
a system. 

A brief review of the ISIS tools, with code frag 
ments as examples, makes the idea of virtual syn- 
chrony more concrete. All ISIS tools can be used 
from C , Lisp, and FORTRAN and, if desired, in 
conjunction with other mechanisms such as Sun 
tools or NeWS and X Windows. 

Messages 

ISIS defines a new type of object called a message. 
ISIS uses messages in various ways. A message is 
created and manipulated much as an input/out 
put stream is. The sequence in Figure 6 creates a 
message and then scans the contents into vari- 
ables stock , date, time , and quote. Notice the similar- 
ity to fprintf and fscanf. Format items may spec 
ify base types (as above), variable length arrays, or 
user-defined structures. 

The most common thing to do with a message 
is to send it to an entry point defined by another 
process. Rather than require three steps for this 
process (generate, transmit, deallocate), many 
ISIS system calls combine all of these steps into a 
single action (see below.) 

Joining Process Groups 
and Obtaining Group Views 

A process uses the pg_join request to join a pro- 
cess group. As Figure 7 shows, several sorts of 
options can be specified. Here, the c allin g process 
requests that it be added to the process group 
named /analysis /technology. The group re 
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message *mp : 

mp - m$g_gen(“Xs, Xs M. Xf’\ “SUN”. “3/17/89", 1022, 162.5); 
msg_get(mp, “Xs, Xs Xd, Xf’\ fcstock. Mate, &time, &quote) ; 

Figure 6. Creating a message and scanning the contents into stock date time 
and quote variables. 

Figure 7. Specifying options to join a process group 

gaddr « pg_join(“ONewYork; /analysis/technology ' * , 
PG_CREDENTIALS, “signature”, 

PG J(FER , gstate_out, gstate_in. 

PG_M0NIT0R , gstatejnon, 

PG_INIT, gstate_lnit, 

PG_L0GG£D, TRUE, 

0 ); 

sides in a group of sites called “New York” (site 
groups are somewhat like process groups, al 
though less dynamic). The handling of the join 
depends on whether or not the group is already 
operational within this group of sites. If it is, the 
credentials string is used to validate permission 
of the new process to join. A state transfer is then 
initiated by invocation of the state transfer “out” 
routine in some operational group member. This 
encodes the state into one or more messages and 
then transmits them by calling an ISIS-supplied 
xfer.out routine; for each call, the correspond- 
ing “in” routine (here, gstate_in) will be invoked 
remotely. Figure 8 illustrates this. 

The pg_join routine behaves differently if the 
group is not already running. In this case, ISIS 
creates a new instance of the group from scratch. 

If the group state is not logged (controlled by 
PG_LOGGED), or if this is the first time the ap- 
plication has ever been started, ISIS calls the 
PG_INIT routine. Otherwise, if the member is 
one of the last to have failed in the old group, the 
group state is rolled in from a log automatically 
maintained by ISIS on behalf of the member. If 
the log is out of date, the joining process must wait 
until the last members to fail recover. Logging is 
not the default and is not likely to be common in 
ISIS. 

Finally, ISIS posts a monitor. Each group mem- 
bership change is reported to the routine 
gstate_mon, and if all group members monitor 
the membership, the changes are synchronous. 

Although joining is a multistage algorithm, it 
looks to the outside world like an instantaneous 
event (see Figure 9). For example, no broadcast 
ever reaches some group members before a join, 
and a broadcast reaches some after, so group 
members can use the group-membership “view” 
as part of the algorithm for deciding how to be 
have. (Remember the ticker example?) This data 
structure lists the current members in the order 
they joined. Group membership lists change one 
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Figure 8. State transfer to a new member of a process group. 


Figure 9. Group joins appear instantaneous to the outside world. 
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by one, and all members see the same changes 
and in the same order with respect to other types 
of events. 

Broadcasts and Replies 

To broadcast to a process group, a program first 
looks up its group address and then sends the 
desired message. If a reply is needed, the caller 
can wait for one, all, n , or a majority of replies. 
The general format is: 

bcast(addr . entry, out-fmt, out_data, n_replies. 
input-format. ^replies); 

The sender specifies the address of the destina- 


tion process or group, the message to send, the 
number of replies desired, and (if nonzero) the 
reply format and variables into which the replies 
should be scanned. Notice that beast combines 
the interface of msg_gen with that of msg_get. 
The ticker process might include code like that in 
Figure 10, in which the set of clients interested in 
Sun quotes is represented by a process group and, 
if the group is not empty, the new quote is trans 
mitted asynchronously (without waiting for deliv- 
ery to occur). NEW_QUOTE is an entry number. 
The idea is that each serv ice exports (as part of its 
interface definition) the entry numbers of services 
it provides, much like the procedure identifiers 
used in an RPC protocol. Users of the service 
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identify their request by specifying an entry num 
ber, as shown in Figure 1 0. Processes that have the 
service tell ISIS what routine to call when mes- 
sages of this sort are received. They use: 

isi$_entry(NEW_ QUOTE, new_quote. L got quote' ' ) ; 

When the quote arrives, ISIS invokes the desig- 
nated procedure as a new lightweight task: 

new_quote(mp) 

message *mp : 

{ 

msg_get(mp. ‘ *%S: %s % 6 , \1 ’ ' , )• 

} 

If needed, a reply can be sent like the original 
request. Facilities for determining who sent a 
message, forwarding a message, dealing with fail- 
ures that occur while waiting for responses, and 
other instances are available with ISIS. 

Replicated Data, Synchronization, 
and Distributed Computations 

ISIS can help a program maintain replicated data, 
using a broadcast to update it but reading data 
locally If desired, a synchronization mechanism 
based on tokens or locks can be used. Our method 
is fault tolerant and has a roll forward recovery 
scheme when failures occur. More important, it is 
asynchronous: No process ever blocks when do 
ing a read or an update or releasing a lock (block- 
ing when a lock is requested is obviously unavoid- 
I able). This means that replicated data in ISIS costs 
little more than unreplicated data, provided that 
the network bandwidth can accommodate the 
background message traffic generated. In practi- 
cal terms, ISIS VI. 2 can update data replicated 
among 5 processes on separate Sun-3/60 worksta- 
tions at a rate of 50- 1 00 updates per second. 

ISIS has several choices for distributing a com- 
putation across the members of a group. A coordi- 
nator-cohort scheme selects some member to be 
responsible for a request. Noncoordinator pro- 
cesses function as passive backups, taking no ac 
tion unless a failure occurs (if replicated data 
is updated, the coordinator broadcasts its 
changes). 

The approach handles load sharing by permit 
ting multiple coordinators to run simultaneously 
in different processes when several requests are 
pending. A redundant computation is one in 
which all members execute a request in parallel, 
presumably arriving at identical results and 
changing replicated data in identical ways. A 
third option is a subdivided computation, in which 
each member performs part of a request, with 
the collected outcome presented to the caller. 

Any of these methods can be programmed 
with one or two subroutine calls to ISIS. In addi- 


tion, callers can transmit a request to a subset of 
the group members, rather than broadcast to ail, 
if only a single response is needed. 

Watching and Monitoring 

ISIS provides ways to monitor changes to group 
membership and to watch individual processes 
for failure. The system ensures that if any process 
senses a failure, all interested processes will ob- 
serve the same event. If the failure is transient, the 
“failed” process will be forced to rejoin the sys- 
tem, discarding or retaining and updating its in- 
ternal state at the programmer’s option. 

The toolkit contains other facilities, including 
an automated program-restart facility that oper- 
ates after crashes and onsite recovery, a news 
facility for a program-level analog to the network- 
news service, and an extensive log-based recovery 
mechanism. This facility is now being extended to 
allow an operational group of processes to log 
information for a process that is inaccessible or 
down, as an alternative to doing a large state 
transfer when it recovers. 

A Distributed Algorithm for 
Subdividing a Task 

In the ticker example above, recall the problem of 
subdividing work among a set of tickers. To solve 
this problem, we can program the tickers to moni- 
tor group membership, noting the number of 
members [nmembers) and their relative ranking in 
the list (rank). Given that all members see the 
same sequence of group views, dividing the alpha 
bet into nmembers pans is straightforward, as is 
assigning responsibility for the ith pan to the two 
processes with ranks equal to i and (id- 1) mod 
nmembers. 

Each time membership changes, the work as 
signment must be recomputed, raising the ques 
tion of how to synchronize the ticker input stream 
with the membership changes. Obviously, if the 
input stream is obtained from the Dow Jones wire 
service, it will not arrive in the form of ISIS broad- 
casts. Consequently, if one process switches be 
fore some other process does, a gap may result 
during which coverage of the stocks is incom- 
plete. One way to solve this problem is for pro- 
cesses to operate briefly under two rules simulta- 
neously: the old rule and the new one. 
Meanwhile, a distinguished process (say the one 
with rank 0) polls the group to confirm that ail 
members have actually started using the new 
ranking. When this has occurred, a broadcast is 
transmitted to inform group members that they 
can stop using the old ranking. Briefly clients will 
have received as many as four copies of some 
quotes, but none will be missed. 

Figure 1 1 shows a fragment of the correspond- 
ing code as it might be implemented in ISIS. As- 
sume that ticker_mon was specified in pg_Join 


Summer 1989 


SunTechnology 99 



Figure 10. Ticker process in which 
clients interested in Sun quotes are 
represented by a process group. 


new_sun_quote( . . . ) 

< 

address gaddr : 

gaddr • pg_lookup( ‘ 'a*orkstations : /stocks/sun ’ 1 ) : 
If ( ! addr_lsnull(gaddr)) 

bcast(gaddr,NBLQUOTE, ‘ ‘Xs. 'SUN' ' , . . .0); 

} 


Figure 11. Program to distribute stock quotes. 


•define POLL 1 

•define SWITCH 2 


/’ This ticker’s primary and secondary responsibilities */ 
char PrimLow, PrimHigh, SecLow, SecHigh; 

/’ If non-zero, we are switching to new values */ 
char OldPrlmLow, OldPrimHigh, OldSeclow, OldSecHigh; 


/• beast if I should disseminate this 
quote V 

address gaddr - pg_lookup( stock); 
if (!addr_lsnull( gaddr)) 

beast (gaddr. NEWJJUOTE. “%s= %s%d, %d”. 

.... 0 ); 

} 

} 


•define between(c, low, high) (c >« low M c <- high) 
mainQ 


isis_entry(poll, POLL, ' ‘poll a member’ ’); 
isis_entry( switchover, SWITCH, ' ‘ switchover done ”) ; 
pg_join(“ONYC: tickers”, PGJIONITOR, ticker j»n, ...) ; 
/* Create * ’wire monitoring’ ' task */ 
t_fork(watch_wire) ; 
lslsj»ainloop(); 


/* Read quotes from the ticker, as if it were a keyboard */ 
watch_wire() 

{ 

while(TRUE) 

( 

/* Get a new quote from the wire, disseminate V 
read_quote(istock, Mate, &time, &prlce) : 
got_quote(stock, date, time, price); 


} 

/* On receiving a new quote, broadcast it if I am 
responsible */ 

got_quote(stock. date, time, price) 
char ’stock, *date ; 
int time, price; 

{ 

register c - ’stock; 

if(bet«een(c. PrimLow, PrimHigh) || between(c. SecLow. 

SecHigh) || 

(OldPrlmLow ti (between(c, OldPrlmLow. OldPrimHigh) || 

...)) 


/* Reconfigure after group neuter ship changes V 
tlcker_iion(gv) 
groupvle* *gv ; 

{ 

/• Temporarily use both work decompositions */ 
OldPrlmLow - PrimLow; OldPrimHigh - PrimHigh. 
PrimLow - ’A* ♦ 26/gv->gv_nmembers’ L my_rank(gv); 
PrimHigh ■ ‘A’ ♦ 

26/gv- >gv_nmembers*mod( 1+«y_rank ( gv ) . 

gv-)gv_nmembers)-1; 

SecLow - etc ; 

if(»y_rank(gv) — 0) { 

/• Poll group members (including self) ’/ 
bcast(tickgroup, POLL, ALL, ””) ; 

/’ All can switch over */ 
bcast(tickgroup, SWITCH, — , 0) ; 

} 


/* Send an empty reply; what counts is that I got 
the message •/ 
poll(mp) 
message ’mp; 

reply(np. 

} 


/* Switchover is completed. Stop monitoring stocks from 
old view ’/ 
switchover (mp) 
message *mp ; 

{ 

OldPrlmLow - OldPrimHigh - ... - 0; •; * 
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as the routine monitoring the state of the ticker 
group. The method is slightly simplified for pre 
sentation. For example, it should be extended to 
deal with multiple failures by having a list of old 
mappings, mstead of just one. It would also be 
desirable to cache group addresses. 

The idea of virtual synchrony simplifies the 
switchover logic. Because the process doing the 
polling operation has already observed the new 
view, all processes that receive a poll message will 
also have seen it. Thus the replies they send need 
not carry any data. Because the sender waits for 
replies from everyone, the sender cannot meet 
our objective of executing the second stage 
broadcast (the one to the SWITCH entry point) 
when all the processes have definitely received 
the new view. 

The algorithm in Figure 1 1 cannot tolerate a 
second failure that occurs before it switches over 
the new work assignment. The problem is that the 
group could start a second reconfiguration while 
the first is still underway. The switchover message 
for the first failure would then be interpreted as if 
it applied to the second failure. One possibility is 
simply to accept the risk that a few quotes will be 
lost if two failures occur in rapid succession, al- 
though the odds are small. The alternative is to 
change ticker_mon to eliminate this problem by 
checking to see if the view sequence number for the 
group changed while the POLL was occurring. 
ISIS maintains view sequence numbers, which in* 

! crement with each view change. If the number 
does change, the SWITCH message should not 
be sent. Some other task, which is also running 
ticker_mon, will be responsible for new reconfi- 
guration. 12 Figure 12 shows a version that toler- 
ates arbitrary sequences of failures. 

Although the logic behind this change is subtle, 
keep in mind that without ISIS, the same problem 
is nearly impossible to solve in fault -tolerant fash 
ion. Also, much of the complexity stems from 
ticker-mon's being a lightweight task that can 
be reentered while it is asleep in beast, a potential 
problem in any system with lightweight tasks. 
ISIS may not make fault tolerant reconfiguration 
easy, but it does make arriving at a concise, cor 
rect solution feasible. 

The META Operating System 

Mechanisms such as remote procedure calls and 
the ISIS tools act as a ‘‘glue 1 ' programmers can 
use to build distributed programs. With the ISIS 
Toolkit, programmers can concentrate on prob- 
lems such as how certain data structures should 
be shared and how control should be distributed 
and ignore problems such as how failures can be 
detected consistently or how updates to a replicat- 
ed data structure can be made atomic. In a formal 
sense, ISIS does not allow programmers to write 
more powerful programs, but it does make the 


tlckerjnon(gv) 
group view *gv ; 

{ 

int vlewid » gv>gv_vle*ld; 

/* Record old values if reconfiguration is not already underway */ 
if(01dPri!How -- 0) 

( OldPrimLow - PrlnLot; OldPrimHign - PrlmHlgh; } 

Priitow ■ ' A ’ ♦ 26/gv-)gv_nnefflCiers*my_rank(gv); 

PrirfU^i - ’A’ ♦ 26/gv->gv_nmefitoers*iTOd(1+ffly_rani((gv) < gv-)gv.rmert>ers)-1 ; 
.... etc 

if(«y_rank(gv) -- 0) { 

bcast(tlckgroup. POLL, ALL. ' }; 

/• Check to make sure view didn't change while waiting for replies V 
if(gv-)gv_viewld -- vlewid 
bcast(tickgroup. SWITCH. ••**. op 

j 

} 


Figure 12. A fault-tolerant version of the ticker ^mon procedure 

task much easier, and the chances of the program 
mer’s writing a correct fault-tolerant distributed 
program are much higher than if he were to use 
simple remote procedure calls. 

Distributed systems , however, are not merely 
distributed programs. Applying the term system to 
a set of programs, distributed or centralized, im- 
plies that the interconnections between the pro- 
grams are nontrivial. Moreover, a distributed sys- 
tem must deal with a complex and changing 
runtime environment. For example, consider the 
stock brokerage system. We may want to monitor 
the news wires for keywords and have market - 
forecasting programs adapt to major world news 
events. When such an event occurs, the entire set 
of programs run may differ from the normal case. 

That is, a single external event can have sweeping 
implications that span most of the distributed sys- 
tem. To solve this kind of problem, especially if 
our system may have to deal with many such 
events in differing ways, we again need glue. 

Here, however, the glue permits us to do program- 
ming in the large. 

In essence, a distributed system consists of a set 
of programs, which may be distributed ones, and 
a form of glue that controls and interconnects 
them. Typically, the system is mediated by the 
operating system. In current distributed systems, 
the varieties of glue consist of network services such 
as fileservers, electronic-mail servers, name- 
servers, lock managers, and even ticker services. 

Unfortunately, network services are not every- 
thing that’s needed. The programmer is still ORfOSfV/iL PAGE IS 
forced to worry about basic problems of how to Qf POOR QUALITY 
monitor for an event, how to alert a program that 
an event has occurred, or how to ensure that the 
failure of a server won’t cripple the system. Net- 
work services are lacking mechanisms analogous 
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to those in the ISIS Toolkit, but they are oriented 
toward programming in the large. As a result, 
program interconnection in current distributed 
operating systems is unsophisticated at best. It is 
hardly surprising that current distributed systems 
provide little more than a collection oflocal oper- 
ating systems supporting remote execution, load 
balancing, and transparent location of files. 

Like ISIS, the META project provides a better 
glue. Our objective is to build a layer of operating- 
system-like software that will span many ma- 
chines in a network: on the order of hundreds to 
thousands of workstations. META will not replace 
the underlying operating systems but instead will 
offer the higher-level glue to allow large fault- 
tolerant distributed applications to be easily inter- 
connected and controlled. 

Currently META consists of three major pieces: 

1 . A distributed, highly available filesystem built 
from standard network filesystems and that 
supports the standard NFS file-access proto- 
cols. 

2. An event manager that allows programs to 
interact by using fault-tolerant events. 

3. An event monitor that interprets policy rules 
written in a system-independent language. 

META is in an early stage of development and 
experimentation. The structure will change and 
expand with time and experience. 

The META File System 

The filesystem is the pivotal component of a dis- 
tributed system. Virtually all sharing of persistent 
data occurs through the filesystem, and much of 
the performance of a distributed system is deter- 
mined by the performance of the filesystem. As a 
result, the majority of the current research taking 
place in distributed filesystems has focused on 
increasing performance. 15 

Performance is not the only property needed 
from a distributed filesystem. Also important is to 
ensure availability to key files; otherwise, the fail- 
ure of a server can lock an application or the 
workstation itself. Key files include relatively stat- 
ic ones such as system-configuration files and dy 
namic ones such as log files and text files. More 
over, the distributed filesystem should provide 
the structure needed to interconnect perhaps 
hundreds of filesystems, including slow local 
ones, large shared repositories, and special-pur- 
pose filelike devices into a coherent whole. A dis- 
tributed filesystem should also be easy to man- 
age; current ones require too much effort on the 
part of the system administrators who partition 
disks schedule backup procedures. 

The META File System consists of two parts: a 
distributed control service that uses file replication to 
give high availability and a set of data repositories. 


The control service implements both replication 
and the distributed filesystem abstractions need 
ed to deal with large-scale file management. The 
repositories consist of commercially available file 
servers that can also be used to store nonreplicat 
ed files. 

The current prototype uses a simple file replica 
tion algorithm and stores files on NFS servers. 14 It 
is completely transparent to both clients and serv- 
ers. Its structure is shown in Figure 13. The inter 
mediate agents implement the META File System 
control service. They guarantee that all updates 
are ordered with respect to other updates, that all 
available replicas are written, and that the crash 
and later recovery of an NFS server makes the 
replicas on that server current. The replication 
incurs a cost, but UNIX caching hides the major 
ity of it. 

The META Event Manager 

A filesystem lets programs share data, but it is not 
very useful for synchronization of programs. In 
order for programs to interact, they need a way to 
signal and await conditions, at a high level that 
can span large numbers of machines or programs. 

For example, consider a utility called PMake , 
which is a distributed version of the UNIX make 
program. PMake needs to locate a set of machines 
that are lightly loaded, have the correct construe 
tion tools available, and have enough resources to 
complete a set of construction steps in a reason 
able amount of time. Lacking META, PMake 
must solve this problem by talking with a name 
server to locate a set of possible machines, a lock 
manager to tentatively allocate the machines, and 
rstatd to determine if the machine is lightly load 
ed. There simply isn’t any easy way to decide if 
machines have the right tools or filesystems 
mounted. If these or other properties are taken 
into account, either an existing service would 
have to be expanded or a new one written. 

The META Event Manager makes writing pro 
grams such as PMake easier. Like the META File 
System, the Event Manager is primarily a distrib- 
uted control program that mediates between pro 
grams awaiting general events or needing re 
sources into specific requests on existing services. 
PMake issues a description of its needs to the 
Event Manager and simply waits for the Event 
Manager to satisfy them. The Event Manager in 
eludes a generic “server” for new types of queries 
to be added easily. The architecture of the Event 
Manager is shown in Figure 14. 

As seen by a client, the Event Manager has a set 
of tables and functions that represent, respectively, 
static and dynamic properties about the system. 
The client can query these tables with a simple 
procedural interface. At this level, the Event Man 
ager resembles a temporal database whth highly 
available tables. 15 
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Figure 13. META File System architecture. A Figure 14. META Event Manager architecture 

substantially extended version should he operational 
in mid- 1989 


Internally, the Event Manager maintains the 
tables and functions both as private tables and as 
queries to existing services. The Event Manager 
uses replication for fault tolerance. This manager 
can take advantage of any replication that the 
existing services already supply. For example, 
suppose the Event Manager maintains a function 
representing the temperature inside a reaction 
vessel. For this value to be available, some replica- 
tion must exist; otherwise, the single failure of the 
only temperature sensor will make the function 
inaccessible. 

Two methods are available for doing the repli- 
cation: another temperature sensor or a pressure 
sensor and use of Boyle’s law. By having programs 
read these values indirectly through the Event 
Manager, either physical value can be translated 
into the desired logical one by a single computa 
tion (supplied to us by the programmer who de 
fines “temperatures”). 

New services can be added by a method similar 
to RPC stub generation. When creating a new ser 
vice, the implementor writes an interface describ 
ing the information being sensed and the kind of 
sensor available (for example, edge sensitive or 
polled). A stub compiler generates a sensor stub 
that calls the sensor, a monitor stub the Event 
Manager uses to access the sensor, and location 
information that allows the Event Manager to 
bind to the sensor. 

The META Event Monitor 

PMake might want to allocate five Sun-4 worksta- 
tions, each with a low load and 1 6 MB of memory; 1 


however, there may be other restrictions on the 
workstations chosen. For example, if a worksta 
tion is slated for maintenance in the next hour, or 
the network to which the workstation is attached 
will be used for several large file archives, the 
workstation should be considered temporarily un- 
available. The programmer cannot know all the 
restrictions ahead of time, so the Event Manager 
denotes an abstract property of the workstation: 
its availability. 

How will an event be generated? Rather than 
write a separate program monitoring each possi- 
ble condition, the META Event Monitor keeps a 
simple rule base specifying the conditions that lead 
to a workstation’s unavailability. Here, one rule 
might be: 

DURING StartPM(fnachine) - 1 hour TO EndPM(machlne) 
THEN Unavailable(fnachlne) 

The META Event Monitor is not meant only for 
monitoring network conditions. It is useful for 
specifying any policy rules that can be translated 
into basic actions that other programs will follow. 
A more elaborate example is a hospital system 
containing several distributed programs, such as 
programs that locate a doctor and send emergen 
cy messages, sensor systems that monitor pa- 
tients, programs that schedule operating rooms, 
and systems that prescribe drugs. Policy rules can 
tie these programs together — for example, to 
alert a doctor if a patient has an adverse reaction 
to a drug, to assemble the necessary resources for 
an emergency admission of a patient, or to locate 
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a replacement doctor if the primary one is not 
available. The policy rules can be altered as the 
resources of the hospital change, but the pro- 
grams supplying the mechanisms need not 
change. 

Using the Event Monitor offers several advan- 
tages. It permits policy rules to be separated from 
programs and specified in an explicit and concise 
manner and builds a special-purpose program to 
implement each rule. The rules can evolve with- 
out the necessity of extensive system changes or 
reprogramming. 

The Event Monitor itself is a distributed pro- 
gram that behaves somewhat like an expert sys- 
tem. It maintains a set of rules, which it continu- 
ously and concurrently evaluates. The rules are 
written in a realtime version of interval logic 16 and 
are evaluated against the META Event Manger’s 
tables and functions. 

Availability 

ISIS is publically available in the United States and 
subject to some minor export restrictions in most 
other countries. Commercial support for ISIS is 
available from ISIS Distributed Systems, Inc. 
Source is provided with the system, which can 
currently be used to interconnect Sun, HP, DEC, 
Apollo, and Gould computer systems running 
variants of Berkeley UNIX. A MACH port has 
been completed, and ports to AIX and possibly 
VMS, as well as interfaces to the toolkit from 
other languages, are planned. The META Operat- 
ing System is still under development; plans to 
distribute it have not yet been established. ^ 
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